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Practical considerations 
Rmpi 

metaMix uses the Rmpi package (http://cran.r-project.org/web/packagcs/Rmpi/). 
Rmpi is the R wrapper to MPI, the Message-Passing Interface. MPI is the stan- 
dardized and portable system allowing parallel programs communicate with each 
other. 

Number of chains and tempering scheme 

The optimal number of chains used in the parallel tempering is not obvious. 
The main idea is that the number of chains used must be large enough to 
ensure successful swaps between all neighboring chains. The first limitation is 
the number of chains we can run on a computer. The computing facility we use 
for all analyses is the UCL CS cluster. The availability of suitable machines as 
well as considerations towards minimizing the queuing time of the submitted 
jobs, defines the maximum number we can use as N — 12 chains. 

The choice of temperature values is motivated by the fact that these must 
not be too far apart, so that exchange of values between the chains can occur. 
Additionally the maximum value must be high enough so that no chains become 
trapped in local minima, hence allowing for global moves. 

We implemented a power decay heating scheme: 

t n = (t n -i - K) a , where n = 2,...,N, Ke(0, 1) and a > 1 (1) 

and using K = 0.001 and a = 3/2 we achieve a slowly heating sequence of 
chains with a lot of chains similar to the target. 

We find that for N < 10 the maximum temperature is not very high, hinder- 
ing a quick global exploration. Ideally we would prefer to run 14 to 20 chains 
but given our computing constraint N = 12 performs satisfactorily. 

Number of iterations 

Given the described setup of our Parallel Tempering, we find that the MCMC 
produces reasonable results in (5 x number of potential species) iterations for 
each chain. 

Relative abundance estimation 

We estimate the relative abundances w for a given set of species, using either a 
frcqucntist - the Expectation-Maximization algorithm - or a Bayesian approach 
- the Gibbs Sampler. 

Let us recall the metagenomic mixture model notation introduced in the 
main text. The data consist of N sequencing reads X = (x\, . . . , Xn)- The 
relative abundance proportions, that is the proportion of each of the k species 
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in the mixture is denoted by w — (wi, Wk) ■ The probability of observing 
the read X{ conditional on the assumption that it originated from species Sj is 
p(x l \x l from Sj) = Pij. 

In the mixture model setting each read x il 1 < i < n is assumed to arise 
from a specific but unknown component of the mixture. The mixture structure 
is deconvoluted by the introduction of latent variables: we associate Xi with 
Zi = (zn, . . . ,Zik), a fc-dimensional vector indicating to which component Xi 
belongs. 

{1 if Xi belongs to class j , , 

0 otherwise 

Expectation Maximization approach 

The EM approach to parameter estimation obtains point estimates of the pa- 
rameters by maximizing the likelihood [1]. 

The algorithm iterations consist of two steps. In the first step, the expected 
value of the missing variables Zi is computed based on p(z\x, w) . In the next 
step we calculate the new mixing parameters w that maximize the expected 
complete-data log likelihood 

E[lnp(X, z\w)] = 'y^ j p(z\x, w) lnp(x, z\w) (3) 

z 

with the complete-data likelihood given by 

n k 

p(X,z\w) = Hl[(w j p ij y* (4) 

i=l j=l 



EM algorithm 



• Initialization k/°) 

• At iteration t 

1. Expectation step. 

Generate zf 1 from p{z^ = j\xi,w^ ^). 

. _ P{Zi =j,Xi\w) _ p(zi=j\w)p(xi\zi=j,w) _ Wjp t:j 



p(Xi\w) T, k j=lP( z i = j\w)p(Xi\Zi =j,w) Ylj=l w jPij 

(5) 

2. Maximization step. 

Given Zi from E-step, calculate new w^> that maximize the expec- 
tation of the complete-data log-likelihood (eq.3). 
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w 



(*) 



argmaxE[lnp(A, z\w)\ (6) 



It can be shown that this 



v^ n (*) 



Gibbs sampling approach 

The Gibbs sampler is a Markov Chain Monte Carlo method based on the suc- 
cessive simulation of z and w [2] . After convergence we obtain the full posterior 
distribution of w. 



Gibbs sampler 



1. Initialization u/°) 

2. At iteration t 

1. Generate zf 1 from yizf 1 = j\xi,w^ ^). So 

Zl ~ Mult{l;z ll ( t - 1 \...,zl k ^) (8) 
where z\j is given by equation 5. 

2. Compute nf =Eti4f- 

3. Generate from 

tt(w\zW) ~ D(ai + nf \ . . . , a fe + 4* } ) (9) 



(9) can be explained as 7r(w|z) cx 7r(z|iu)7r(i«), where for the mixing parameters 
w a conjugate prior n(w) is the Dirichlet distribution with parameters a = 
{ot\ , Qffe}. Additionally: 

n n n k k 

n(z\ W ) = n A*iW = n < x ■ ■ ■ < ik = n n *>? = n w ? ^ 

i—l i i—lj — l j — 1 

hence sampling w from Dir <~ («i + m, . . . , a k + rik) ■ 
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Importance Sampling, Defensive Sampling, MLE approxi- 
mation 



Accounting for the uncertainty in mixture weights w, a Monte Carlo approxi- 
mation can be used for the marginal likelihood, by drawing independent samples 
from the prior to estimate P(X) and averaging the likelihood. The simulation 
from the prior is computationally inefficient, as the majority of samples arc out- 
side the regions of high likelihood. Importance sampling (IS) techniques can 
be used to reduce the variance of the estimator. To obtain an efficient IS pro- 
posal distribution, we first approximate the posterior distribution P(w\X) with 
a normal multivariate distribution g, based on the samples generated by the 
Gibbs sampler. We then generate n samples j/j <~ J\f(n, S), 1 < i < n. Defining 
the IS weights a» = -^4, the IS estimator of the marginal likelihood is then: 



The method works well if the importance distribution is fairly similar to the 
target distribution but with heavier tails. Using the posterior approximation as 
the importance distribution means that the IS estimator becomes the harmonic 
mean estimator (HME). HME is known to be unstable and to overestimate the 
marginal likelihood P(X). In order to overcome this issue and to make the 
distribution tails heavier, we perform defensive importance sampling by using a 
mixture of posterior and prior as the IS distribution [3]. This approach is only 
slightly costlier in computational time compared to the typical IS. 

As discussed in the main text, for the IS we have approximated the posterior 
distribution P(w\X) with a normal multivariate distribution g, based on the 
output of the Gibbs sampler. However the posterior approximation g as IS 
proposal distribution means that the marginal likelihood estimator becomes 
the Harmonic Mean estimator. This estimator can be unstable; a solution to 
this problem is the defensive importance sampling [3]. The main idea is the 
incorporation of a heavy tail component in the importance function g , effectively 
substituting it by the mixture density: 



where A is close to 1. A natural choice for q as the stabilizing factor is the 
prior density ir. 

In practice that means that the samples we use for the defensive IS estimator 
are generated with probability A from g and with probability 1 — A from n . 

Comparison of the 3 approaches 

We compared the performance of metaMix on the same simHC FAMcS dataset, 
using Importance Sampling and Defensive Importance Sampling (95% samples 
produced by posterior approximating g and 5% by tt) for the marginal likelihood 
estimation as well as using the MLE approximation. The latter are the results 



I = 



(ii) 



\g(y) + (l-\)q(y),0<\<l 



(12) 
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presented in the main text. For the IS and the defensive IS at each MCMC 
iteration 1,000 samples were drawn from the proposal distribution. 

The resulting species profiles can be seen in table S3. We ran the MCMC for 
1,000 iterations in order to obtain results within 24 hours. All three approaches 
produced almost identical results, in terms of species identified and abundance 
estimates, with the defensive IS performing a bit better in terms of abundance 
estimation accuracy. However the MLE approximation method was ^13x times 
faster than the other two, reducing the time required from ~19 hours to 90 
minutes. 



Table 1: FAMeS simHC - comparing the effect of different marginal likelihood esti- 
mation methods on metaMix performance: species profiling, accuracy of abundance 
estimation and computational time. 





Importance 


Defensive 


MLE 




Sampling 


Sampling 


approximation 


Number of species 


116 


116 


116 


w estimate - rRMSE 


17 


16.8 


17.1 


w estimate - AVGRE 


8.5 


8.3 


8.6 


Computational time (hours) 


18.6 


18.6 


1.5 



5 



References 

[1] Dempster, A. and Laird, N. (1977) Maximum likelihood from incomplete 
data via the EM algorithm. Journal of the Royal Statistical Society., 39(1), 
1-38. 

[2] Diebolt, J. and Robert, C. P. (1994) Estimation of Finite Mixture Distri- 
butions through Bayesian Sampling. Journal of the Royal Statistical Society 
Series B Methodological, 56(2), 363-375. 

[3] Hesterberg, T. (1995) Weighted average importance sampling and defensive 
mixture distributions. Technometrics, 37(2), 185-194. 



G 



